Discriminative Clustering of Text Documents

نویسندگان

  • Jaakko Peltonen
  • Janne Sinkkonen
  • Samuel Kaski
چکیده

Vector-space and distributional methods for text document clustering are discussed. Discriminative clustering, a recently proposed method, uses external data to find taskrelevant characteristics of the documents, yet the clustering is defined even with no external data. We introduce a distributional version of discriminative clustering that represents text documents as probability distributions. The methods are tested in the task of clustering scientific document abstracts, and the ability of the methods to predict an independent topical classification of the abstracts is compared. The discriminative methods found topically more meaningful clusters than the vector space and distributional clustering models.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

خوشه‌بندی اسناد مبتنی بر آنتولوژی و رویکرد فازی

Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...

متن کامل

TCUAP: A Novel Approach of Text Clustering Using Asymmetric Proximity

Text documents have sparse data spaces and current existing methods of text clustering use symmetry proximity to measure the correlation of documents. In this paper, we propose a novel approach to strengthen the discriminative feature of document objects, which uses asymmetric proximity for text clustering. We present a measure of asymmetric proximity between documents and between clusters. TCU...

متن کامل

Text Document Clustering Using DPM with Concept and Feature Analysis

Clustering is one of the most important techniques in machine learning and data mining tasks. Similar documents are grouped by performing clustering techniques. Similarity measuring is used to determine transaction relationships. Hierarchical clustering model produces tree structured results. Partitioned based clustering produces the outcome in grid format. Text documents are unstructured data ...

متن کامل

Concept Chain Based Text Clustering

Different from familiar clustering objects, text documents have sparse data spaces. A common way of representing a document is as a bag of its component words, but the semantic relations between words are ignored. In this paper, we propose a novel document representation approach to strengthen the discriminative feature of document objects. We replace terms of documents with concepts in WordNet...

متن کامل

Locally discriminative topic modeling

Topic modeling is a powerful tool for discovering the underlying or hidden structure in text corpora. Typical algorithms for topic modeling include probabilistic latent semantic analysis (PLSA) and latent Dirichlet allocation (LDA). Despite their different inspirations, both approaches are instances of generative model, whereas the discriminative structure of the documents is ignored. In this p...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002